OVERVIEW

The code in this replication package will guide the reader through the 6 steps needed to recreate the analyses of the manuscript as well as guide readers through the process of using the method on their own data. Readers should go through each folder in step order and open the step specific readme folder (named after its associated step). While some of the readme folders are simple text files, others also serve as bash scripts that need to be run in order to move on to the next step. This is true for steps 1, 3, and 4. Because there are two steps where hand annotations occur (steps 2 and 6), there is a dual path system set up within the step folders. The replication folders are where new attempts to replicate the results from scratch should be saved and called from, while the Manuscript folders are where the results based on the hand coding done for the manuscript can be found. If these are not split in a step, it means that the results from a full replication can be used in the next step without any hand annotation. After opening the readme/text file in each step, the user should make sure the folder paths point to the right location both when loading data and exporting data for subsequent steps. To generate the figures that can be generated without the final hand annotation step, run "Legistar Comparison.R" from Step 5. This generates appendix figures 3,4, and 7 as well as the results discussed in Appendix A. To generate the remaining figures using the previously hand annotated data, run "Hand Coded Comparison.R" in Step 6. To hand annotate a new set, create datatsets that match the ones in the "Step 6/Manuscript Data/" for each model being used to extract agenda items. The replicator should expect the code to run for at least 45 hours. The vast majority of this is training the object detection model in Step 3 five times for each of the five cities. The remaining time largely involves extracting the agenda items from the raw pdfs (found in Step 1) using the trained object detection models (found in step 3) which is done in Step 4. 

DATA AVAILABILITY AND PROVENANCE STATEMENTS

The data necessary for this analysis comes from two publicly available sources. The first data source is the raw pdf documents for each meeting record for the five cities in my analysis. These pdf records can be found through the city's website, then downloaded one by one. The location of this data is in "Step 1/Raw PDFS". The second data source is the Legistar webpages each of these five cities maintain. These websites can be found by searching the city name followed by the word "legistar" and contain databases of all meetings and action items for the time that records were kept. To download the datasets as they appear in my analysis, the user should navigate to the home page of the legistar site for each city, then select the legislation tab. Then the user should select "All Years" in the drop down menu directly to the right of the search bar on the page. Then, the user should click "Show" followed by "Show all records" in the drop down menu directly to the right of the item count. Finally, the user should click the "Export" button and select "Export to Excel." The downloaded versions can be found in "Step 5/Legistar Data." 

DATASET LIST

The list of datasets is organized by step and includes all the datasets necessary for the running of the code. When City and Meeting Number are in brackets it implies that it is the same for all five cities and for each number of meeting records per year included in the training set (1-5 meetings per year)

Step 3:

- "Step 3/Model Output/{City}/{Meeting Number}/From Manuscript/eval.csv" : Each of these csv files (of which there are 25), contain the average precision of the trained object detection model for each city using between one and five annotated meetings per year as the training set. This data is used to generate Appendix Figure 4. The only number of importance from these files are the numbers in the second column and the second row (B2), which indicate the average precision of the fine tuned model. 

Step 4:

- "Step 4/Data Extraction/{City}/{Meeting Number}/{City}_raw.xlsx" : Each of these excel files contain the dataset of agenda items extracted from the raw meeting minutes records using the "Step 4/Step 4.sh" bash script. There are only three columns in the excel file. The first is the extracted text, the second is the pdf location and name which is used to extract the day of the meeting, and the third is the confidence threshold used in the extraction process to identify agenda items. To replicate my analysis, Step 4 only needs to be used for the "Two Meetings" set of models.  

Step 5:

- "Step 5/Legistar Data/{City} Legistar.xlsx" : each of these excel files (one for each city) were generated using the process described in the data availability and provenenance statements section. These excel files include columns for each piece of metadata surrounding the agenda item that legistar tracks for that city. While there are numerous columns, the only ones used in this analysis are the "MatterTitle" which contains the agenda item text, and the "MatterAgendaDate" which contains the date the item appeared on the agenda. The remaining are unclear in their exact definition and use, as there is no available codebook for each city. 

Step 6:

- "Step 6/Manuscript Data/{City} Merged.xlsx" : Each of these excel files includes four variables. First is the count of the number of agenda items from that daty, title "Items". The Second is the date. The third (Type) is whether the count comes from my new method or from Legistar, indicated as ODV and SLV respectively (Object Detection Version and Scraped Legistar Version. Finally, the fourth column indicates the city the data comes from.

- "Step 6/Manuscript Data/{City} Compare Text ODV.xlsx" : Each of these excel files includes four variables. First (ODV_text) contains the agenda item text collected using my new method. The Second is the date. The third is whether the item comes from my new method or from Legistar, indicated as ODV and SLV respectively (Object Detection Version and Scraped Legistar Version. Finally, the fourth column indicates whether the agenda item scraped from the meeting records can be found in the data downloaded from Legistar. This involves opening the excel sheet I discuss next and comparing by hand the agenda items for each meeting record. 

- "Step 6/Manuscript Data/{City} Compare Text SLV.xlsx" : Each of these excel files includes four variables. First (SLV_text) contains the agenda item text collected from Legistar. The Second is the date. The third is whether the count item from my new method or from Legistar, indicated as ODV and SLV respectively (Object Detection Version and Scraped Legistar Version. Finally, the fourth column indicates whether the agenda item scraped from the meeting records can be found in the data collected using my method. This involves opening the excel sheet discussed above and comparing by hand the agenda items for each meeting record.

 - "Step 6/Manuscript Data/{City} Compare Words ODV.xlsx" : Each of these excel files includes Five variables. First (ODV_text) contains the agenda item text collected using my new method. The Second is the date. The third is whether the item comes from my new method or from Legistar, indicated as ODV and SLV respectively (Object Detection Version and Scraped Legistar Version. The fourth column indicates whether the agenda item scraped from the meeting records can be found in the data downloaded from Legistar. This involves opening the excel sheet I discussed above and comparing by hand the agenda items for each meeting record. Finally, the fifth column (SLV_text) is the text copied directly from the SLV excel sheet being checked against and pasted into this column for every matched agenda item pair identified. This is then used to compare the lexical similarity of the matched methods. 
 
  "Step 6/Manuscript Data/{City} Compare Text Hand.xlsx" : Each of these excel files includes seven variables. First (handcoded_text) includes the agenda items copied directly from the raw meeting records in "Step One/Raw PDFS". The Second is the date. The third (version) indicates that the data is collected by hand. The fourth column (In_ODV) indicates whether the agenda item collected by hand can be found in the data collected using my method. This involves opening the excel sheet {City} Compare Text ODV.xlsx and comparing by hand the agenda items for each meeting record. The fifth column (ODV_text) is the text copied directly from the ODV excel sheet being checked against and pasted into this column for every matched agenda item pair identified. The sixth column (In_SLV) indicates whether the agenda item collected by hand can be found in the data downloaded from Legistar. This involves opening the excel sheet "{City} Compare Text SLV.xlsx" and comparing by hand the agenda items for each meeting record. The seventh column (SLV_text) is the text copied directly from the SLV excel sheet being checked against and pasted into this column for every matched agenda item pair identified. This is then used to compare the lexical similarity of the matched methods.
 
 
COMPUTATIONAL REQUIREMENTS

System requirements:

platform (x86_64-pc-linux-gnu)
OS (Ubuntu 22.04.5 LTS)
Graphics (NVIDIA Corporation GA102 [GeForce RTX 3080])

Programs Needed:

virtualenvwrapper (6.1.0)
Bash (5.1.16(1))
Docker (26.1.3)

Python (3.8.1) Packages:

pdf2image (1.17.0)
PyPDF2 (3.0.1)
ipykernel (6.29.5)
jupyterlab (4.3.5)
notebook (7.3.2)
wheel (0.40.0)
setuptools (75.3.0)
pypdfium2 (4.30.1)
layoutparser (0.3.4)
torchvision (0.10.0+cu111)
detectron2 (0.5)
poppler-utils (0.1.0)
label-studio (1.13.1)
scikit-learn (1.3.2)
funcy (2.0)
openpyxl (3.1.5)
layoutparser[ocr] (0.3.4)
tesseract-ocr (0.3.13)
Pillow (9.5.0)

R 4.4.2 (2024-10-31) Packages:

readxl (1.4.3)
dplyr (1.1.4)
stringr (1.5.1)
tidyverse (2.0.0)
writexl (1.5.1)
ggplot2 (3.5.1)
lubridate (1.9.4)
stringdist (0.9.15)
grid (4.4.2)
ggforce (0.4.2)
RColorBrewer (1.1-3)
textreuse (0.1.5)
purrr (1.0.2)
quanteda (4.2.0)
gt (0.11.1)

Memory, Runtime, and Storage Requirements
- Requires approximately 20 GB of free storage
- Requires approximately 45 hours of runtime
- Requires GPU with at most 10 GB

DESCRIPTION OF PROGRAMS/CODE

Step 1

- "Step 1/Step 1.sh" : This bash script creates subfolders for step 2 of the replication and runs the "01_Create_TestSet.ipynb" file in python. 

- "Step 1/Raw PDFS/01_Create_TestSet.ipynb" : This python script randomly selects one meeting record per year per city and converts the pdf into an image of each page and populates it into "Step 2/Training Set/". 

Step 3

- "Step 3/Step 3.sh" : This bash script provides instructions for installing the correct docker container, installing the required python packages, and runs the "Model Training.ipynb" file in python.

- "Step 3/Model Training.ipynb" : This python script creates a folder system in "Step 3/Model Output" to host the output of the model training. Next it runs a python script that splits the annotated training set into a training and testing set. Finally, it runs a python script that executes the fine tuning of the object detection model with the training set and saves the output of the fine tuned model to the folders created in the first part of the script. 

 Step 4
 
 - "Step 4/Step 4.sh" : This bash script imports layoutparser, tells the user how to change the folder path depending on use case, and runs the "Data Extraction.ipynb" python script. 
 
 - "Step 4/Data Extraction.ipynb" : This python script applies the generated fine tuned models from step 3 to every meeting record pdf for each city and extracts identified agenda items. It does this for each of the 5 models trained (each city and with two meeting records per year) and for 5 different confidence threshold levels. It then saves these extracted agenda items as excel files in "Step 4/Data Extraction/".
 
 Step 5
 
 - "Step 5/Legistar Comparison.R" : This R do file loads the raw extracted agenda items, compares the count to the number identified on Legistar, creates a random subset for the handcoded validation process carried out in Step 6, and creates Appendix figures 3,4, and 7. The figures are saved in "Final Figures/"
 
 Step 6
 
 - Step 6/Hand Coded Comparison.R" : This R do file imports the hand coded validation datasets described above, runs a lexical similarity comparison, and generates Figure 2 and Table 1 in the main manuscript. The figures are saved in "Final Figures/"    

INSTRUCTIONS TO REPLICATORS

To carry out the replication for this manuscript, go through each step one at a time. When beginning a new step, open the "Step {Number}.sh" or txt file and follow the instructions within the script. These instructions include things that must be done independent of simply running the bash script such as how to create the docker container or how to change the folder paths depending on the use. Each python script in each step can be run through the bash script, and the python scripts should not need to be opened or edited unless the folder paths need updating. Every item in this replication package should be placed in a parent folder called "Replication Attempt 1" to make the replication process simple and keep the folder paths consistent. When extracting zipped files, extract them to the location of the zipped file and delete the zip file afterwards.

Note on Stochastic Elements: 

There are several steps within the replication package where stochastic elements can lead to slightly divergent results. In Step 1, a random subset of the PDFs are selected for hand annotation through label studio. Setting a seed at this stage will help maintain replicability for future replications. The actual hand annotating that occurs in Step 2 can also introduce differences in the outputs. Slight differences in how the annotation boxes are drawn in label studio may lead to different object detection models and thus different model accuracy. In Step 3, additional stochastic elements are introduced in the training of the model. Identical selections of PDFs in Step 1 and identical hand annotations in Step 2 can still lead to slight differences in model outputs due to the fundamentally stochastic nature of the model training process. These differences should be minor, but cannot be removed through a set seed. Finally, the selection of meeting days for the hand annotation comparisons at the end of Step 5 also needs a set seed to be replicable. Even with the same set of meeting days however, the hand annotation of matched agenda items may differ from person to person depending on whether their decision criteria for matching is more or less stringent.   

LIST OF TABLES AND FIGURES

 - "Final Figures/Table 1.png" : This figure is generated in "Step 6/Hand Coded Comparison.R" on line 355. (Same as Appendix Table 1.png)
 
 - "Final Figures/Figure 2.png" : This figure is generated in "Step 6/Hand Coded Comparison.R" on line 344. (Same as Appendix Figure 8.png)

 - "Final Figures/Appendix Figure 3.png" : This figure is generated in "Step 5/Legistar Comparison.R" on line 549.
 
 - "Final Figures/Appendix Figure 4.png" : This figure is generated in "Step 5/Legistar Comparison.R" on line 614.

 - "Final Figures/Appendix Figure 7.png" : This figure is generated in "Step 5/Legistar Comparison.R" on line 459.





